Exploratory Data Analysis: Wisconsin Diagnostic Breast Cancer (WDBC)¶

1.1 Introduction¶

This report analyzes the Wisconsin Diagnostic Breast Cancer (WDBC) dataset to identify key features distinguishing malignant from benign tumors. The data features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing the characteristics of the cell nuclei present in the image (Wolberg et al., 1995).

1.2 Data Acquisition¶

The raw data was retrieved directly from the UCI Machine Learning Repository to ensure reproducibility. The dataset consists of 569 instances with 30 real-valued input features and one binary target variable (Diagnosis).

In [1]:
# Imports (only nesseccary for EDA)
import pandas as pd
import numpy as np

import altair_ally as aly
import altair as alt
alt.data_transformers.enable('vegafusion')

from ucimlrepo import fetch_ucirepo
In [2]:
# import the data
# Code from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# Need ucimlrepo package to load the data
raw_data = fetch_ucirepo(id=17)

raw_X = raw_data.data.features
raw_y = raw_data.data.targets

raw_df = pd.concat([raw_X, raw_y], axis=1)
raw_df.to_csv("../data/raw/breast_cancer_raw.csv", index=False)

2. Data Cleaning and Schema Mapping¶

The raw dataset lacks semantic column headers. To facilitate analysis, we implemented a schema mapping strategy based on the wdbc.names metadata. The 30 features represent ten distinct cell nucleus characteristics (e.g., Radius, Texture) computed in three statistical forms.

We applied the following suffix mapping transformation:

  • Mean Value: Suffix 1 -> _mean
  • Standard Error: Suffix 2 -> _se
  • Worst (Max) Value: Suffix 3 -> _max

This step ensures all features are semantically interpretable for the subsequent EDA.

In [3]:
# Data Cleaning
# There is no NA in the dataset

# Clean the column names based on description
clean_columns = []
for col in raw_X.columns:
    if col.endswith('1'):
        clean_name = col[:-1] + '_mean'
    elif col.endswith('2'):
        clean_name = col[:-1] + '_se'
    elif col.endswith('3'):
        clean_name = col[:-1] + '_max'
    else:
        clean_name = col
    
    clean_columns.append(clean_name)
raw_X.columns = clean_columns
X = raw_X.copy()

# Clean the target column
y = raw_y.copy()
y['Diagnosis'] = y['Diagnosis'].map({'M': 'Malignant', 'B': 'Benign'})
clean_df = pd.concat([X, y], axis=1)

# Export the cleaned data
clean_df.to_csv('../data/processed/breast_cancer_cleaned.csv', index=False)

clean_df
Out[3]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean ... texture_max perimeter_max area_max smoothness_max compactness_max concavity_max concave_points_max symmetry_max fractal_dimension_max Diagnosis
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890 Malignant
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902 Malignant
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758 Malignant
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300 Malignant
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678 Malignant
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 0.05623 ... 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115 Malignant
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 0.05533 ... 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637 Malignant
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 0.05648 ... 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820 Malignant
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 0.07016 ... 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400 Malignant
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 0.05884 ... 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039 Benign

569 rows × 31 columns

3. Data Profiling: Structure and Statistics¶

Purpose:

  • df.info(): Used to verify data integrity by checking for null values and ensuring all feature columns are of float64 type.
  • df.describe(): Used to examine the central tendency and spread of numeric features. This highlights differences in magnitude (scales) across variables.

Observation: The dataset is complete (no missing values). However, describe() reveals massive scale disparities (e.g., area_mean ranges up to 2500, while smoothness_mean is < 0.1), confirming the necessity for Feature Scaling (Standardization) before modeling.

In [4]:
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   radius_mean             569 non-null    float64
 1   texture_mean            569 non-null    float64
 2   perimeter_mean          569 non-null    float64
 3   area_mean               569 non-null    float64
 4   smoothness_mean         569 non-null    float64
 5   compactness_mean        569 non-null    float64
 6   concavity_mean          569 non-null    float64
 7   concave_points_mean     569 non-null    float64
 8   symmetry_mean           569 non-null    float64
 9   fractal_dimension_mean  569 non-null    float64
 10  radius_se               569 non-null    float64
 11  texture_se              569 non-null    float64
 12  perimeter_se            569 non-null    float64
 13  area_se                 569 non-null    float64
 14  smoothness_se           569 non-null    float64
 15  compactness_se          569 non-null    float64
 16  concavity_se            569 non-null    float64
 17  concave_points_se       569 non-null    float64
 18  symmetry_se             569 non-null    float64
 19  fractal_dimension_se    569 non-null    float64
 20  radius_max              569 non-null    float64
 21  texture_max             569 non-null    float64
 22  perimeter_max           569 non-null    float64
 23  area_max                569 non-null    float64
 24  smoothness_max          569 non-null    float64
 25  compactness_max         569 non-null    float64
 26  concavity_max           569 non-null    float64
 27  concave_points_max      569 non-null    float64
 28  symmetry_max            569 non-null    float64
 29  fractal_dimension_max   569 non-null    float64
 30  Diagnosis               569 non-null    object 
dtypes: float64(30), object(1)
memory usage: 137.9+ KB
In [5]:
clean_df.describe()
Out[5]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean ... radius_max texture_max perimeter_max area_max smoothness_max compactness_max concavity_max concave_points_max symmetry_max fractal_dimension_max
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

4. Correlation Analysis: Pearson vs. Spearman¶

Method:

  • Pearson Correlation: Measures linear relationships.
  • Spearman Correlation: Measures monotonic rank relationships (non-linear). Comparing both helps identify if relationships are strictly linear or just trending in the same direction.

Purpose: To detect Multicollinearity—redundant features that increase model complexity without adding information.

Results: Both metrics show near-perfect correlation ($>0.95$) between Radius, Perimeter, and Area. This confirms these features are geometrically redundant. We should retain only one (e.g., Radius) and drop the others to improve model stability.

In [6]:
# Multicollinearity

corr_chart = aly.corr(clean_df)

corr_chart.save('../results/images/corr_chart.png')
corr_chart.save('../results/images/corr_chart.svg')

corr_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated.  Use 'selection_point'
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/vegalite/v5/api.py:394: AltairDeprecationWarning: The value of 'empty' should be True or False.
  warnings.warn(
Out[6]:

5. Pairwise Separability Analysis¶

Purpose: To visualize 2D decision boundaries. We look for feature combinations where the Benign (Blue) and Malignant (Orange) clusters are clearly distinct with minimal overlap.

Results:

  • High Separability: Features related to size (radius_mean) and shape complexity (concavity_mean) separate the classes well.
  • Non-linear patterns: The curved relationship between area and radius is clearly visible, reinforcing the geometric redundancy found in the correlation analysis.
In [7]:
# Only include mean as it provide a lot of info
cols_mean = [c for c in clean_df.columns if '_mean' in c] + ['Diagnosis']
pair_chart = aly.pair(clean_df[cols_mean], color='Diagnosis:N')

pair_chart.save('../results/images/pair_chart.png')
pair_chart.save('../results/images/pair_chart.svg')

pair_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated.  Use 'selection_point'
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
Out[7]:

6. Distribution Analysis¶

Purpose: To inspect the univariate "shape" of the data. We look for Skewness (asymmetry) and Outliers that could bias linear models.

Results:

  • Skewness: Features like area_se and concavity_mean are heavily right-skewed (long tail to the right). This indicates that Log Transformation is required to normalize these distributions.
  • Overlap: "Texture" and "Smoothness" show high overlap between classes, suggesting they are less informative on their own compared to "Size" features.
In [8]:
dist_chart =aly.dist(clean_df, color='Diagnosis')

dist_chart.save('../results/images/dist_chart.png')
dist_chart.save('../results/images/dist_chart.svg')

dist_chart
Out[8]:

EDA Findings¶

  • Class Separation:
    • High Separability: Features related to size (radius, perimeter, area) and concavity (concave_points, concavity) show clear distinction between Benign and Malignant classes (Malignant samples generally have higher values).
    • Low Separability: Texture, Smoothness, and Fractal Dimension show significant overlap, indicating they are weaker individual predictors.
  • Distributions:
    • Skewness: "Area" and "Concavity" features (both _mean and _se) are heavily right-skewed.
    • Outliers: Visible in the upper tails of area_max and perimeter_se.
  • Correlations (Multicollinearity):
    • Severe Multicollinearity: radius, perimeter, and area are perfectly correlated ($R \approx 1$). This is expected geometrically but redundant for models.
    • concavity, concave_points, and compactness also exhibit very high positive correlation.

Preprocessing Recommendations¶

Based on the above, the following pipeline is suggesued:

  1. Feature Selection / Drop:
    • Remove redundant features to reduce multicollinearity. Keep radius (or perimeter), but drop area and perimeter as they duplicate information.
  2. Transformation:
    • Apply Log Transformation to skewed features (e.g., area, concavity) to normalize distributions.
  3. Scaling:
    • Features vary vastly in scale (e.g., area > 1000 vs. smoothness < 0.2). Use StandardScaler to standardize all features to unit variance.
  4. Imputation:
    • None needed (Data is clean).

Onto Creating a Classification Model¶

In [9]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X = clean_df.drop('Diagnosis', axis=1)
y = clean_df['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
In [10]:
X_train.columns
Out[10]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 'perimeter_max',
       'area_max', 'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max'],
      dtype='object')
In [11]:
numeric_feats = ['radius_mean', 'texture_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 
       'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max']

drop_feats = [
    'perimeter_mean',
    'area_mean',
    'perimeter_se',
    'area_se',
    'texture_se',
    'smoothness_se',
    'symmetry_se',
    'perimeter_max',
    'area_max'
]
In [12]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

ct = make_column_transformer(    
    (StandardScaler(), numeric_feats), 
    ("drop", drop_feats)
)

pipe = Pipeline([
    ("preprocess", ct),
    ("svc", SVC())
])

param_grid = {
    "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}

gs = GridSearchCV(
    estimator = pipe,
    param_grid = param_grid,
    cv = 15,
    n_jobs = -1,
    return_train_score = True
)

gs.fit(X_train, y_train)
Out[12]:
GridSearchCV(cv=15,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         ['radius_mean',
                                                                          'texture_mean',
                                                                          'smoothness_mean',
                                                                          'compactness_mean',
                                                                          'concavity_mean',
                                                                          'concave_points_mean',
                                                                          'symmetry_mean',
                                                                          'fractal_dimension_mean',
                                                                          'radius_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'compactness_se',
                                                                          'concavity_se',
                                                                          'con...
                                                                          'concavity_max',
                                                                          'concave_points_max',
                                                                          'symmetry_max',
                                                                          'fractal_dimension_max']),
                                                                        ('drop',
                                                                         'drop',
                                                                         ['perimeter_mean',
                                                                          'area_mean',
                                                                          'perimeter_se',
                                                                          'area_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'symmetry_se',
                                                                          'perimeter_max',
                                                                          'area_max'])])),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
             return_train_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=15,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         ['radius_mean',
                                                                          'texture_mean',
                                                                          'smoothness_mean',
                                                                          'compactness_mean',
                                                                          'concavity_mean',
                                                                          'concave_points_mean',
                                                                          'symmetry_mean',
                                                                          'fractal_dimension_mean',
                                                                          'radius_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'compactness_se',
                                                                          'concavity_se',
                                                                          'con...
                                                                          'concavity_max',
                                                                          'concave_points_max',
                                                                          'symmetry_max',
                                                                          'fractal_dimension_max']),
                                                                        ('drop',
                                                                         'drop',
                                                                         ['perimeter_mean',
                                                                          'area_mean',
                                                                          'perimeter_se',
                                                                          'area_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'symmetry_se',
                                                                          'perimeter_max',
                                                                          'area_max'])])),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
             return_train_score=True)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['radius_mean',
                                                   'texture_mean',
                                                   'smoothness_mean',
                                                   'compactness_mean',
                                                   'concavity_mean',
                                                   'concave_points_mean',
                                                   'symmetry_mean',
                                                   'fractal_dimension_mean',
                                                   'radius_se', 'texture_se',
                                                   'smoothness_se',
                                                   'compactness_se',
                                                   'concavity_se',
                                                   'concave_points_se',
                                                   'symmetry_se',
                                                   'fractal_dimension_se',
                                                   'radius_max', 'texture_max',
                                                   'smoothness_max',
                                                   'compactness_max',
                                                   'concavity_max',
                                                   'concave_points_max',
                                                   'symmetry_max',
                                                   'fractal_dimension_max']),
                                                 ('drop', 'drop',
                                                  ['perimeter_mean',
                                                   'area_mean', 'perimeter_se',
                                                   'area_se', 'texture_se',
                                                   'smoothness_se',
                                                   'symmetry_se',
                                                   'perimeter_max',
                                                   'area_max'])])),
                ('svc', SVC())])
ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['radius_mean', 'texture_mean',
                                  'smoothness_mean', 'compactness_mean',
                                  'concavity_mean', 'concave_points_mean',
                                  'symmetry_mean', 'fractal_dimension_mean',
                                  'radius_se', 'texture_se', 'smoothness_se',
                                  'compactness_se', 'concavity_se',
                                  'concave_points_se', 'symmetry_se',
                                  'fractal_dimension_se', 'radius_max',
                                  'texture_max', 'smoothness_max',
                                  'compactness_max', 'concavity_max',
                                  'concave_points_max', 'symmetry_max',
                                  'fractal_dimension_max']),
                                ('drop', 'drop',
                                 ['perimeter_mean', 'area_mean', 'perimeter_se',
                                  'area_se', 'texture_se', 'smoothness_se',
                                  'symmetry_se', 'perimeter_max',
                                  'area_max'])])
['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_max', 'texture_max', 'smoothness_max', 'compactness_max', 'concavity_max', 'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
StandardScaler()
['perimeter_mean', 'area_mean', 'perimeter_se', 'area_se', 'texture_se', 'smoothness_se', 'symmetry_se', 'perimeter_max', 'area_max']
drop
SVC()
In [13]:
results = pd.DataFrame(gs.cv_results_)

best_performing = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].sort_values(
    by='mean_test_score', ascending=False
).head(10)

heatmap_data = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].copy()
heatmap_data['C'] = heatmap_data['param_svc__C'].astype(str)
heatmap_data['gamma'] = heatmap_data['param_svc__gamma'].astype(str)

heatmap = alt.Chart(heatmap_data).mark_rect().encode(
    x = alt.X('gamma:N', title='gamma'),
    y = alt.Y('C:N', title='C'),
    color = alt.Color('mean_test_score:Q', scale=alt.Scale(scheme='viridis')),
    tooltip = ['C', 'gamma', 'mean_test_score']
).properties(
    width = 400,
    height = 400,
    title = 'SVM GridSearchCV Mean Test Scores'
)
In [14]:
best_performing
Out[14]:
param_svc__C param_svc__gamma mean_test_score
25 10 0.01 0.969176
31 100 0.01 0.966667
30 100 0.001 0.960287
19 1.0 0.01 0.955986
24 10 0.001 0.955914
20 1.0 0.1 0.955914
26 10 0.1 0.953620
32 100 0.1 0.951470
18 1.0 0.001 0.931613
14 0.1 0.1 0.927455
In [15]:
heatmap.display()
In [16]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = gs.predict(X_test)

report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop('support', axis = 1).drop(['macro avg', 'weighted avg'])
report_df
Out[16]:
precision recall f1-score
Benign 0.986486 1.000000 0.993197
Malignant 1.000000 0.975610 0.987654
accuracy 0.991228 0.991228 0.991228
In [17]:
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index = gs.classes_, columns = gs.classes_)

cm_melted = cm_df.reset_index().melt(id_vars='index')
cm_melted.columns = ['Actual', 'Predicted', 'Count']

heatmap = alt.Chart(cm_melted).mark_rect().encode(
    x = alt.X('Predicted:N', title = 'Predicted'),
    y = alt.Y('Actual:N', title = 'Actual'),
    color = alt.Color('Count:Q', scale = alt.Scale(scheme ='viridis'))
).properties(
    width = 400,
    height = 400,
    title = 'Confusion Matrix Heatmap'
)

text = alt.Chart(cm_melted).mark_text(color = 'white').encode(
    x = 'Predicted:N',
    y = 'Actual:N',
    text = 'Count:Q'
)

heatmap + text
Out[17]:
In [ ]: